Halo: a hardware-agnostic accelerator orchestration software framework for heterogeneous computing systems

ABSTRACT

Hardware-agnostic accelerator orchestration (HALO) provides a software framework for heterogeneous computing systems. Hardware-agnostic programming with high performance portability is envisioned to be a bedrock for realizing adoption of emerging accelerator technologies in heterogeneous computing systems, such as high-performance computing (HPC) systems, data center computing systems, and edge computing systems. The adoption of emerging accelerators is key to achieving greater scale and performance in heterogeneous computing systems. Accordingly, embodiments described herein provide a flexible hardware-agnostic environment that allows application developers to develop high-performance applications without knowledge of the underlying hardware.

RELATED APPLICATIONS

This application claims the benefit of provisional patent application Ser. No. 62/983,220, filed Feb. 28, 2020, the disclosure of which is hereby incorporated herein by reference in its entirety.

The present application is related to concurrently filed U.S. patent application Ser. No. ______ filed on ______ entitled “C²MPI: A Hardware-Agnostic Message Passing Interface for Heterogeneous Computing Systems,” U.S. patent application Ser. No. ______ filed on ______ entitled “A Stand-Alone Accelerator Protocol (SAP) for Heterogeneous Computing Systems,” and U.S. patent application Ser. No. ______ filed on ______ entitled “A Software-Defined Board Support Package (SW-BSP) for Stand-Alone Reconfigurable Accelerators,” the disclosures of which are hereby incorporated herein by reference in their entireties.

FIELD OF THE DISCLOSURE

The present disclosure is generally related to heterogeneous computing systems, such as large-scale high-performance computing systems, data center computing systems, or edge computing systems which include specialized hardware.

BACKGROUND

Today's high-performance computing (HPC) systems are largely structured based on traditional central processing units (CPUs) with tightly coupled general-purpose graphics processing units (GPUs, which can be considered domain-specific accelerators). GPUs have a different programming model than CPUs and are only efficient in exploiting spatial parallelism for accelerating high-concurrency algorithms but not the temporal/pipeline parallelism vital to accelerating high-dependency algorithms that are widely used in predictive simulations for computational science. As a result, today's HPC systems still have huge room for improvement in terms of performance and energy efficiency for running complex scientific computing tasks (e.g., many large pieces of legacy HPC codes for predictive simulations are still running on CPUs).

In recent years, a few more accelerator choices for heterogeneous computing systems (e.g., HPC and other large-scale computing systems) have emerged, such as field-programmable gate arrays (FPGAs, which can be considered reconfigurable accelerators) and tensor processing units (TPUs, which can be considered application-specific accelerators). Although these new accelerators offer flexible or customized hardware architectures with excellent capabilities for exploiting temporal/pipeline parallelism efficiently, their adoption in extreme-scale scientific computing is still at its infancy and is expected to be a tortuous process (as was adoption of GPUs) regardless of their superior performance and energy efficiency benefits.

FIG. 1 is a diagram illustrating divergent execution flows of hardware-optimized application codes in existing HPC systems. The fundamental challenge to the adoption of any new accelerators in HPC, such as FPGAs and TPUs, is that each accelerator's programming model, message passing interface, and virtualization stack is developed independently and is specific to the respective hardware architecture. With the lack of clarity in the demarcation between hardware-specific and hardware-agnostic development regions, today's programming models require domain-matter experts (DMEs) and hardware-matter experts (HMEs) to work interdependently to make a significant effort in optimizing hardware-specific codes in order to adopt new accelerator devices in HPC and gain performance benefits. This tangled association is a self-imposed bottleneck from existing programming models that impairs a future in true heterogeneous HPC and severely impacts the velocity of scientific discovery.

SUMMARY

Hardware-agnostic accelerator orchestration (HALO) provides a software framework for heterogeneous computing systems. Hardware-agnostic programming with high performance portability is envisioned to be a bedrock for realizing adoption of emerging accelerator technologies in heterogeneous computing systems, such as high-performance computing (HPC) systems, data center computing systems, and edge computing systems. The adoption of emerging accelerators is key to achieving greater scale and performance in heterogeneous computing systems. Accordingly, embodiments described herein provide a flexible hardware-agnostic environment that allows application developers to develop high-performance applications without knowledge of the underlying hardware.

HALO is presented as an open-ended extensible multi-agent software framework that implements a set of proposed hardware-agnostic principles and a novel compute-centric message passing interface (C²MPI) specification for enabling the portable and performance-optimized execution of hardware-agnostic application host codes across heterogeneous accelerator resources. The platform developed herein provides hardware-agnostic virtualization, routing, and arbitration layers, as well as hardware-centric partitioning and a scaling layer. Most importantly, the platform allows for new hardware accelerators to be plug-and-playable for application acceleration across any network infrastructure.

An exemplary embodiment provides a HALO framework in a heterogeneous computing system. The HALO framework includes a runtime agent configured to implement an application interface for receiving a hardware-agnostic request to execute a computational function; and a virtualization agent configured to implement an accelerator interface for providing a hardware-specific instruction for a hardware accelerator to execute the computational function.

Another exemplary embodiment provides a method for providing HALO in a heterogeneous computing system. The method includes receiving a hardware-agnostic code comprising a request to execute a computational function; determining that a hardware accelerator is available to execute the computational function; and interfacing with the hardware accelerator to execute the computational function.

Another exemplary embodiment provides a non-transitory computer readable medium having stored thereon software instructions that, when executed by a processor, cause the processor to: launch a HALO framework comprising: a runtime agent implementing an application interface to receive a hardware-agnostic request to execute a computational function; and a virtualization agent implementing an accelerator interface to provide a hardware-specific instruction for a hardware accelerator to execute the computational function.

Those skilled in the art will appreciate the scope of the present disclosure and realize additional aspects thereof after reading the following detailed description of the preferred embodiments in association with the accompanying drawing figures.

BRIEF DESCRIPTION OF THE DRAWING FIGURES

The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure, and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a diagram illustrating divergent execution flows of hardware-optimized application codes in existing high-performance computing (HPC) systems.

FIG. 2 is a diagram illustrating unified execution flow of hardware-agnostic application codes provided by embodiments described herein.

FIG. 3 is a schematic diagram of an exemplary heterogeneous computing system according to embodiments described herein.

FIG. 4 illustrates an exemplary interoperability protocol format for virtualization agents in a hardware-agnostic accelerator orchestration (HALO) framework.

FIG. 5 is a software stack diagram of a dual-agent embodiment of the HALO framework.

FIG. 6 is a unified modeling language diagram of the HALO embodiment of FIG. 5 .

FIG. 7 is a more detailed software stack diagram of a runtime agent and a virtualization agent in the HALO embodiment of FIG. 5 .

FIG. 8 is a software stack diagram of a multi-agent embodiment of the HALO framework.

FIG. 9 is a software stack diagram of the runtime agent, illustrating an interface between the runtime agent and the application in the HALO embodiment of FIG. 8 .

FIG. 10 is a software stack diagram of the bridge agent in the HALO embodiment of FIG. 8 .

FIG. 11 is a software stack diagram of the accelerator agent in the HALO embodiment of FIG. 8 .

FIG. 12 is a more detailed software stack diagram of the virtualization agent in the HALO embodiment of FIG. 8 .

FIG. 13 is a diagram illustrating interoperability among a plurality of nodes implementing HALO.

FIG. 14 illustrates a template of a host application source code (e.g., an application data steering program) of a HALO implementation used for evaluation.

FIG. 15 is a flow diagram of a method for providing HALO in a heterogeneous computing system.

FIG. 16 is a block diagram of a computer system using a HALO according to embodiments disclosed herein.

DETAILED DESCRIPTION

The embodiments set forth below represent the necessary information to enable those skilled in the art to practice the embodiments and illustrate the best mode of practicing the embodiments. Upon reading the following description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the disclosure and will recognize applications of these concepts not particularly addressed herein. It should be understood that these concepts and applications fall within the scope of the disclosure and the accompanying claims.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.

It will be understood that when an element such as a layer, region, or substrate is referred to as being “on” or extending “onto” another element, it can be directly on or extend directly onto the other element or intervening elements may also be present. In contrast, when an element is referred to as being “directly on” or extending “directly onto” another element, there are no intervening elements present. Likewise, it will be understood that when an element such as a layer, region, or substrate is referred to as being “over” or extending “over” another element, it can be directly over or extend directly over the other element or intervening elements may also be present. In contrast, when an element is referred to as being “directly over” or extending “directly over” another element, there are no intervening elements present. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected” or “directly coupled” to another element, there are no intervening elements present.

Relative terms such as “below” or “above” or “upper” or “lower” or “horizontal” or “vertical” may be used herein to describe a relationship of one element, layer, or region to another element, layer, or region as illustrated in the Figures. It will be understood that these terms and those discussed above are intended to encompass different orientations of the device in addition to the orientation depicted in the Figures.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including” when used herein specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Hardware-agnostic accelerator orchestration (HALO) provides a software framework for heterogeneous computing systems. Hardware-agnostic programming with high performance portability is envisioned to be a bedrock for realizing adoption of emerging accelerator technologies in heterogeneous computing systems, such as high-performance computing (HPC) systems, data center computing systems, and edge computing systems. The adoption of emerging accelerators is key to achieving greater scale and performance in heterogeneous computing systems. Accordingly, embodiments described herein provide a flexible hardware-agnostic environment that allows application developers to develop high-performance applications without knowledge of the underlying hardware.

HALO is presented as an open-ended extensible multi-agent software framework that implements a set of proposed hardware-agnostic principles and a novel compute-centric message passing interface (C²MPI) specification for enabling the portable and performance-optimized execution of hardware-agnostic application host codes across heterogeneous accelerator resources. The platform developed herein provides hardware-agnostic virtualization, routing, and arbitration layers, as well as hardware-centric partitioning and a scaling layer. Most importantly, the platform allows for new hardware accelerators to be plug-and-playable for application acceleration across any network infrastructure. This platform facilitates dynamic plugin of an accelerator onto the network fabric, which can be auto-discovered and utilized by applications.

I. INTRODUCTION

High-performance computing (HPC) and other large-scale computing system applications have become increasingly complex in recent years. Predictive simulations with increasingly higher spatial and temporal resolutions and ever-growing degrees of freedom are the critical drivers for achieving scientific break-through. The latest advancements in deep learning paired with the next-generation scientific computing applications will inevitably demand orders of magnitude more compute power for future computing infrastructure. In the concluding days of Moore's law, general-purpose solutions will no longer be viable for continuing to meet such an exponential growth in performance that is required to keep pace with scientific innovations. This disclosure envisions that extreme-scale heterogeneous computing systems (e.g., HPC systems, data center computing systems, edge computing systems) that massively integrate various domain- and application-specific accelerators will be a viable blueprint for providing the necessary performance and energy efficiency to meet the challenges of future applications.

However, as described with respect to FIG. 1 , the path to realizing extreme-scale heterogeneous computing systems is tortuous. The main obstacle towards the proliferation of heterogeneous accelerators is the lack of a flexible hardware-agnostic programming model that separates the hardware-specific and performance-critical portion of an application from its logic flow, causing these divergent execution flows between both domain matter experts (DMEs) and hardware matter experts (HMEs). As a result, HPC and other large-scale computing system applications are by no means extensible with regard to new accelerator hardware.

This disclosure envisions that hardware-agnostic programming with high-performance portability will be the bedrock for realizing the pervasive adoption of emerging accelerator technologies in future heterogeneous computing systems. The proposed approach includes hardware-agnostic programming paired with a programming model that enables application developers, scientists, and other users to focus on conditioning and steering data to and from hardware-specific code without any assumption of the underlying hardware. Data conditioning and steering refers to the reorganization and movement of data within an application.

Additionally, performance portability is defined in the strictest sense as the ability for the host code to maintain a single hardware-agnostic control flow, as well as state-of-the-art kernel performance, regardless of platform and/or scale. Additionally, performance portability includes the ability to dynamically handle various accelerators without recompilation of the host code. This is in stark contrast to the current definition that allows for multiple control flows and recompilation processes.

FIG. 2 is a diagram illustrating unified execution flow of hardware-agnostic application codes provided by embodiments described herein.

Embodiments described herein deploy a set of hardware-agnostic principles to enable host code hardware-agnostic programming with true performance portability in the context of heterogeneous computing systems. The proposed hardware-agnostic principles impose a clear demarcation between hardware-specific and hardware-agnostic software development to allow DMEs and HMEs to work independently in completely decoupled development regions to significantly improve practicability, productivity, and efficiency.

To accomplish this, DMEs are restricted to conditioning and steering (orchestrating) data in and out of functional abstractions of hardware-optimized kernels. The hardware-agnostic abstraction of kernels in this regard can be defined by a label and its inputs, outputs, and state variables. Such a functional approach to hardware-agnostic programming is the key to the clear division between the responsibility of DMEs and HMEs. As a result, HMEs will focus on optimizing hardware-specific kernel implementations in their optimal programming environments while being able to eliminate the adoption barrier by leveraging a hardware-agnostic environment via a unified hardware-agnostic accelerator interface. Furthermore, DMEs will focus on application or algorithm development while being able to maintain a single code flow and effortlessly reap the performance benefits of new hardware accelerators by leveraging the hardware-agnostic environment via a unified hardware-agnostic application interface.

A. Heterogeneous Computing System

FIG. 3 is a schematic diagram of an exemplary heterogeneous computing system 10 according to embodiments described herein. The heterogeneous computing system 10 provides hardware-agnostic system environments through a C²MPI interface 12 with an application 14, a HALO framework 16, a stand-alone accelerator protocol (SAP) 18, and a software-defined board support package (SW-BSP) 20 for reconfigurable accelerators. These technologies provide a multi-stack software framework that includes stacks of core system agents with a modular architecture for hardware-agnostic programming and transparent execution with performance portability, interoperability, scalability, and resiliency across an extreme-scale heterogeneous computing system.

The communication message passing interface of various accelerators is unified for the heterogeneous computing system 10 by implementing a novel C²MPI standard described herein. The C²MPI interface 12 extends concepts from the existing message passing interface (MPI) standard and introduces new concepts of computation-centric communication to support interoperable program execution and communication across meshes of heterogeneous accelerators by leveraging remote procedure calls (RPCs). In C²MPI, a central processing unit (CPU) and an accelerator process are treated uniformly by attaching a computation function attribute to each process to allow run-time programs to easily distribute data to various function-specific accelerator processes for highly scalable acceleration.

The HALO framework 14 is provided as an open-ended extensible multi-agent software framework that implements the proposed hardware-agnostic principles and C²MPI specification for enabling the portable and performance-optimized execution of hardware-agnostic application codes across heterogeneous computing devices. Dual-agent embodiments of the HALO framework 14 include two system agents, i.e., a runtime agent and a virtualization agent, which work asynchronously in a star topology. The runtime agent is responsible for implementing and offering the C²MPI interface 12, as well as being the crossbar switch for application processes and virtualization agents. The runtime agent also manages system resources, including device buffers, accelerator manifests, kernels, etc. The virtualization agent provides an asynchronous peer that encapsulates hardware-specific compilers, libraries, runtimes, and drivers. The runtime and virtualization agents implement common inter-process communication (IPC) channels for interoperability between multiple virtualization agents, which allows HALO to scale the number of accelerator types supported while maintaining the simplicity and structure of the framework.

Multi-agent embodiments of the HALO framework 14 consist of a set of core system agents (i.e., runtime agent, bridge agent, accelerator agent, virtualization agent) implementing a plug-and-play architecture for the purpose of scale and resiliency. The runtime agent and virtualization agent operate similar to dual-agent embodiments. The purpose of the bridge agent is to interoperate between the CPU and accelerator domains. The primary responsibility of the accelerator agent is to interconnect the entire accelerator domain and provide interoperability among heterogeneous accelerators across multiple nodes.

SAP 18 provides a new architectural standard for scalable stand-alone accelerators, facilitating implementation of large clusters of stand-alone accelerators via a network 22. Reconfigurable accelerators can implement the SAP using SW-BSP 20.

B. Definitions

The following terms are defined for clarity:

Programming Model: The programming methodology used to construct a unified execution flow at the host code level using HALO.

Portability: The ability of a host program (code) to remain unmodified and operational regardless of the underlying hardware.

Performance Portability: The ability of an application to maintain a high performance relative to the respective hardware-optimized implementations across different computing devices without needing to modify the host code nor redeploy binaries.

Agent: An asynchronous process running on an operating system that takes its input from an IPC channel (i.e., a forked process).

Micro-Framework: The software framework encapsulating an external runtime, library, and/or software framework that is used as the final stage of the virtualization agent to communicate beyond the HALO boundaries.

DME: An application developer or scientist that focuses on conditioning and steering data for pre-defined processing and analytics for scientific discovery.

HME: An optimization and/or hardware expert that focuses on developing performance-critical, hardware-optimized device kernels for the data processing and analytics subroutines needed by DMEs.

C. C²MPI Specification

The proposed C²MPI specification extends the capabilities of the existing message passing interface (MPI) from communication libraries to a hardware-agnostic programming model. C²MPI introduces a heterogeneous ranking system and a kernel execution model that will enable developers to claim and invoke accelerator resources as an abstracted function-specific subroutine. Some embodiments of C²MPI are designed as an extension of the legacy MPI to simplify and ease the adoption of HALO into existing MPI-enabled applications and minimize the learning curve for developers. In other embodiments, other data interfaces may be similarly adapted to facilitate hardware-agnostic programming. C²MPI seizes on the notion of ranks and introduces heterogeneous ranks to represent accelerator resources. Levering C²MPI, HALO inherits the coherency, synchronization, and caching semantics from the legacy MPI.

With legacy MPI in mind, C²MPI unifies communication and computation orchestration between accelerators and general-purpose CPUs through a heterogeneous parent-child ranking system that describes all computation resources as ranks. Parent ranks can allocate and manage child ranks. The C²MPI specification is defined by two sub-specifications: one for application parent processes and the other for the accelerator parent processes.

With continuing reference to FIG. 3 , application parent ranks live inside the hardware-agnostic region of the application 14. Application parent ranks are not guaranteed to be performance-portable. Application parent ranks are synonymous with traditional MPI ranks. They have the full capabilities of a typical MPI-based rank process along with the added capabilities for child rank management. Both application and accelerator processes can allocate and manage child ranks.

Accelerator parent ranks, in addition to managing their own child ranks, have the added responsibilities of hardware management, kernel retrieval, registration, and execution and maintaining system resources allocated by application parent ranks. Similar to MPI-based applications, jobs can instantiate multiple application parent ranks, and each parent rank can be multi-threaded, making requests into the child management system asynchronously. Therefore, the C²MPI interfaces are thread-safe. In some embodiments, C²MPI does not allow system resources to be shared across the boundary of parent ranks. In other embodiments, C²MPI is extended to enable resource sharing across parent ranks.

Child ranks are the virtual abstraction of a system resource in the form of an opaque handle, similar to a parent rank, but with limited capabilities. Such a system resource is not inherently tied to any physical resource at runtime, and the runtime agent of the HALO framework 16 has full authority to move both functionality and allocation to compatible accelerators on the network while assuring computation integrity. Child ranks can be allocated via an application or accelerator parent rank, with both having the lifespan of the job issuing requests. Child ranks can represent a single resource or a set of resources in parallel or pipeline. A pipeline of resources is a series of dependent kernel invocations. Child ranks can be deallocated via C²MPI interfaces, and the resources are freed when MPIX_Finalize gets executed.

Child ranks have attributes that can be defined statically from a configuration file or dynamically at runtime, wherein developers can create an alias for resources and specify kernels and optional executor via a configuration file defining kernel attributes. Child rank management (e.g., via the runtime agent of the HALO framework 16) uses these attributes to allocate resources according to the resource alias.

D. Remote Procedure Calls

Remote procedure call (RPC) is a protocol that is commonly used in a client-server system where clients can offload tasks to a remote entity (e.g., a server). It is a great way to distribute tasks among remote servers and/or peers. They are widely used in web browsers, cloud services (e.g., gRPC), software-as-a-service platforms, container orchestration (e.g., Kubernetes), massively distributed databases (e.g., Oracle), high-performance parallel computing offload programming models, and libraries. Typically, RPC-based software frameworks (e.g., Azure Databricks, Google AI platform, Apache swarm) are used to provide an interface to clients to issue payload, command pairs and have them executed remotely. Similarly, HALO leverages the RPC protocol to encapsulate and remotely execute kernels among a network of agent software processes in a peer-to-peer manner.

II. HALO PRINCIPLES

HALO principles are the principles to keep in mind when developing hardware-agnostic specifications, programming models, and frameworks. The hallmarks of a hardware-agnostic system are to maintain an interface definition devoid of any vendor-specific, hardware-specific, or computational-task-specific implementations or naming conventions. Interfaces must also be domain-agnostic such that method signatures do not imply functionality but a delivery vehicle. For instance, a method called “execute(kernel1, parameter 1 . . . N)” is domain-agnostic, however, “kernel1(parameter1 . . . N)” is not. Additionally, hardware-agnostic and hardware-specific regions must be clearly defined and decoupled with a robust interoperation protocol.

Lastly, abstract functionality must be inclusive to procedures that operate on data and change state. Being domain-agnostic will allow for enormous flexibility and extensibility required to maintain an open-ended HALO software architecture, where HMEs can easily extend the overall system with new accelerator devices and kernel implementations. The purposes of a hardware-agnostic programming model are twofold. The first is to minimize the amount of hardware-dependent software in a codebase while maximizing the portability of the host code across heterogeneous computing devices. The second is to clearly separate the functional (non-performance-critical) and computational (performance-critical) aspects of the application to simplify the adoption of new accelerator hardware as well as the development and integration of hardware-specific and hardware-optimized interfaces/kernels.

III. HALO DESIGN

A. Framework Concepts

1. Domain/Hardware-Agnostic Application Interface

HALO implements the application subset of C²MPI in the C language to offer a domain- and hardware-agnostic interface to DMEs (application developers, scientists) to enable performance portability. The C²MPI application interface will allow DMEs to abstract hardware details from application codes while orchestrating hardware-optimized kernels and achieving the best-in-class performance of accelerator resources without needing to modify application code at all. The C²MPI application interface allows for DMEs to harden the performance portability of application codes for both the current and future accelerator technologies.

2. Domain-Agnostic Accelerator Interface

Similar to the C²MPI application interface, HALO implements the C²MPI accelerator interface in the C language to provide a domain- and hardware-agnostic interface for HMEs with transparent interoperability between system resources and the HALO runtime and virtualization agents. Additionally, the C²MPI accelerator interface provides the specification to support distributed RPC (DRPC), reading meta-data, manifests, and other capability levels provided by remote system resources, including system performance, rank requests, and host and device heap memory allocation.

3. Agent Interoperability

FIG. 4 illustrates an exemplary interoperability protocol format for virtualization agents in the HALO framework 16. HALO enables interoperability among multiple virtualization agents, each supporting a different accelerator device and parent ranks via the runtime agent. The interoperability stems from a combination of purely asynchronous message protocols and specifications based on which an accelerator resource can leverage other system resources, such as runtime and virtualization agents and system-wide memory resources. The interoperability protocol utilizes an asynchronous request-and-response protocol allowing for multiple messages from various runtime and virtualization agents to be serviced in parallel. The accelerator interface and interoperability protocol are implemented in a pipeline (or a chain of responsibility pattern) in the virtualization agent to support code reuse and modularization.

4. Accelerator Multi-Source Kernels Repository

The HALO framework employs a multi-source approach to enable hardware-agnostic programming. Hardware-specific kernels are placed in separate source files that are compiled/linked dynamically by the virtualization agent. The indexing of these kernels is governed by the lookup system discussed in Section III.C. The kernel specifications, such as compatible execution hardware, function signatures, and other meta-data for each kernel, are embedded and communicated to the virtualization agent dynamically or statically. The kernels have a special hardware-agnostic extension (*.ha) that comprises both the kernel specification and binaries.

5. Multi-Agent System

The structural pattern implemented by HALO is a multi-agent system. The multi-agent structural pattern is a direct response to the HALO principles for providing an open-ended extendable architecture that allows runtime and virtualization agents supporting various accelerator devices and system functionalities to be dynamically connected and disconnected from the runtime agents without affecting hardware-agnostic applications. The plug-and-play nature of the HALO architecture, along with clear interface definitions, will allow HMEs to quickly develop, evaluate, and deploy HALO-compatible, hardware-optimized kernels rapidly.

B. Dual-Agent HALO

FIG. 5 is a software stack diagram of a dual-agent embodiment of the HALO framework 16. This dual-agent embodiment may provide the HALO framework 16 in a single-node heterogeneous computing system (e.g., with CPU-hosted accelerators). The exemplary embodiment of the HALO framework 16 is a multi-agent C/C++ software framework that provides hardware-agnostic accelerator orchestration by implementing the C²MPI specification for both application and accelerator frontends. The backend implements a C++-based multi-agent system that includes runtime agents and virtualization agents working together asynchronously.

Each virtualization agent 24 implements a different accelerator specification, while the runtime agent 26 implements both the C²MPI application interface 28 and accelerator interface 30. Topologically, the system is built on a star pattern where the runtime agent acts as a bridge between different ranks and virtualization agents.

FIG. 6 is a unified modeling language diagram of the HALO embodiment of FIG. 5 . The runtime agent 26 and virtualization agent 24 are loosely coupled and interconnected by a domain-agnostic protocol (see FIG. 4 ). These agents operate completely asynchronously with each other. The application parent ranks operate synchronously or asynchronously to and from the runtime agent 26. In order to protect against system-wide race conditions and deadlocks, synchronization points only occur at the application parent rank thread level. Therefore, if a thread in an application parent rank calls a blocking C²MPI method, the blocking mechanism will only block locally to that thread.

1. Runtime Agent

FIG. 7 is a more detailed software stack diagram of the runtime agent 26 and the virtualization agent 24 in the HALO embodiment of FIG. 5 . There is one runtime agent process for each application in progress, providing multi-tenancy support. The runtime agent 26 is a duo-thread agent representing the C²MPI application and accelerator interfaces. The first thread shares the same virtual memory space as the application. The application library frontend interfaces with the MPIX runtime via the two unidirectional lock-free queues, interconnecting the first and second thread bidirectionally. The first thread is a thin thread that handles the synchronicity requirements for the interface without burdening the MPIX runtime. When calls are made to the MPIX runtime, the mode demultiplexer determines whether it takes a native MPIX runtime or legacy MPI runtime route.

The second thread models a proactor pattern (e.g., via ZeroMQ IPC channels) that manages interoperability between the MPIX and Legacy MPI runtimes. The second thread takes messages from the application, the runtime agent 26, and the virtualization agent 24 and processes them via the command processor. Furthermore, the second thread manages system resources, converts synchronous messages to asynchronous messages, as well as encapsulating, serializing, and deserializing messages in/out of the agent interoperability protocol. The runtime agent 26 bridges system messages between MPIX and MPI runtimes asynchronously. The MPIX and MPI runtimes are synchronized through two synchronous queues that feeds the second thread.

2. Virtualization Agent

With continuing reference to FIG. 7 , the virtualization agent 24 implements a chain of responsibility pattern, with the frontend being a proactor (e.g., a proactor enabled by ZeroMQ IPC channels). The virtualization agent 24 provides an asynchronous peer that encapsulates hardware-specific compilers, libraries, runtimes, and drivers also referred to as micro-framework. The virtualization agent 24 embodies a three-thread, three-stage pipeline, where each stage is asynchronous and interconnected via lock-free queues. The first stage is the network manager that deserializes and converts messages between the interconnect protocol format and an object-oriented format. The network manager also places these objects into a shared memory content store to eliminate copies and have a central point to recall messages to be handled by the different stages. The lock-free queues pass around references to the shared memory content store in the form of transaction chains of transaction ID.

The second stage is the system services, which manage requests that are resolvable without hardware intervention. The system services include stored data from the third stage as well as kernel manifest hardware specifications and runtime metrics. The third stage is the device services, where the encapsulation of the vendor logic first occurs. The device services handle device-specific details and integrate the device manager into the virtualization pipeline. Furthermore, the device services manage the kernel repository and pass them to the device manager along with the input payload. The device manager does all the heavy lifting, such as configuring the device, allocating device memory, loading kernels, and multiplexing (time and space) the device. The device manager performs the required configurations, data-movements, and invocations applicable to the corresponding RPCs for interfacing with a runtime, framework, or library.

C. Multi-Agent HALO

FIG. 8 is a software stack diagram of another exemplary embodiment of the HALO framework 16. This multi-agent embodiment may provide the HALO framework 16 in an extreme-scale heterogeneous computing system comprising clusters of CPU-hosted or stand-alone accelerators. In an exemplary aspect, the C²MPI interface 12 is provided by an acceleration-as-a-service (ACaaS) application programming interface (API). Support for legacy MPI is provided by a MPI API 32. The ACaaS API provides a rich set of RPCs for applications to query the available accelerator functions and their status, claim/disclaim function-specific accelerator processes, distribute data/collect results to/from a group of function-specific accelerator processes, and so on. Through the ACaaS API, applications are able to treat a function-specific accelerator process simply as a hardware-agnostic subroutine that conceptualizes a certain type of accelerator hardware pre-programmed to execute a certain function. As such, HALO provides a unified, hardware-agnostic programming model, where an HPC application is only responsible for steering data to a selected set of function-specific subroutines without any presumption of hardware. Thus, applications for HALO implementation can include one or more accelerator data steering programs.

The HALO framework 16 includes a set of core system agents implementing a plug-and-play architecture for the purpose of scale and resiliency. The runtime agent 26 and virtualization agent 24 operate similar to the dual-agent embodiment described above. The purpose of the bridge agent 34 is to interoperate between the processor and accelerator domains. The primary responsibility of the accelerator agent 36 is to interconnect the entire accelerator domain and provide interoperability to multiple processors. Each agent is interconnected by a peer-to-peer link as well as a subscriber and publisher link for the purposes of plug and play. Agents can be dynamically established and have separate and shared memory regions. An inter-HALO API 38 establishes interoperability between processor nodes (e.g., CPUs in a CPU cluster) implementing HALO.

A virtualization-as-a-service (VaaS) API 40 provides an interface for both stand-alone and locally hosted accelerators to interact with the HALO framework 16. The VaaS API 40 provides a lean set of RPCs for an accelerator device to register the available accelerator resources (so HALO is able to identify all supported functions from available kernel/binary libraries), response to the management calls (function programming, data management/transfer, etc.) and scheduling calls issued by HALO for executing assigned acceleration tasks, report performance monitoring data, and so on. Through the VaaS API 40, an accelerator device is able to leverage the virtualization agent 24, a device manager 42, and a hardware driver 44 of the HALO framework to transparently share accelerator hardware resources across multiple application threads, processes, and users with strict performance, fault, and memory isolations for improved system utilization, reliability, and security.

1. Runtime Agent

FIG. 9 is a software stack diagram of the runtime agent 26, illustrating an interface between the runtime agent 26 and the application 14 in the HALO embodiment of FIG. 8 . The application 14 can be run in four different configurations. Configuration #1 corresponds to a fail-safe mode where there is one process for steering data and data marshaling. Configuration #2 corresponds to a fail-safe mode with multiple processes for steering data and data marshaling. Configurations #1 and #2 operate with a legacy MPI runtime routine. Configuration #3 corresponds to a single process for steering data with data marshaling and entry into the acceleration domain. Configuration #4 corresponds to multiple processes for steering data with data marshaling and entry into the acceleration domain. Configurations #3 and #4 provide entry into the accelerator domain using the C²MPI runtime routine and the bridge agent 34. A request multiplexer 46 and mode de-multiplexer 48 make the decision which runtime routine needs to service the request.

2. Bridge Agent

FIG. 10 is a more detailed software stack diagram of the bridge agent 34 in the HALO embodiment of FIG. 8 . The purpose of the bridge agent is to interoperate between the host processor and accelerator domains. HALO implements one bridge agent 34 per processor node (e.g., per CPU in a CPU cluster). The bridge agent 34 can be triggered by the runtime agent 26 (e.g., through an application call) or by the accelerator agent 36. This is critical due to the initiation process of multi-process, multi-accelerator jobs. The bridge agent 34 manages the job resources and descriptions as well as complex acceleration paths, data marshaling, routing, and rerouting. The bridge agent 34 deals with translating job-specific details to the general RPC model that are understood by the accelerator agent 36.

The bridge agent 34 implements an out-of-order execution stage within a configurable thread pool capacity. There are ingress and egress queues to accept requests from multiple processes or threads from the same application or a different application. The thread pool has a set of shared resources to store routing tables, claim tables, a pending message registry, an arbitrator, node performance monitoring tables, configuration and kernel registries, and an application data lake and results. The bridge agent 34 has first level rerouting and provisioning delegation capabilities. The entry points also have a supervisory role over the application. The bridge agent 34 tracks patterns of accelerator usages. The bridge agent 34 can provision accelerators, create complex accelerator paths on behalf of the application, and reroute subsequent requests to new resources. It can also decommission accelerators at runtime as well.

3. Accelerator Agent Overview

FIG. 11 is a software stack diagram of the accelerator agent 36 in the HALO embodiment of FIG. 8 . The accelerator agent 36 is the most complicated module across the HALO framework 16. The primary responsibility of the accelerator agent 36 is to interconnect the entire accelerator domain and provide interoperability to multiple accelerator nodes and processor nodes.

HALO implements a microservice-based distributed plug-and-play system where entry points and accelerator agents 36 of different (or similar) implementations can be interconnected dynamically. The accelerator agent 36 module contains a discovery protocol to identify the different modules that are connected, as well as a heartbeat mechanism to detect whether or not a core node is operational. HALO implements a dynamic routing table to support resiliency in the case that an accelerator endpoint has multiple paths, specifically for a stand-alone accelerator that implements the SAP on the network.

The accelerator agent 36 has a set of caches that monitor processor node performance across the entire job, and dynamically adjusts the path based on congestion protocols. An additional cache may be provided for caching routes to subroutine resources from a remote accelerator agent 36. The accelerator agent 36 also implements a recommendation engine that intelligently assigns a child node to the entry point based on specific recommendation strategies defined by the configuration file. There are two strategies to implement: scattered and compact. Scattered implies that within a set of heterogeneous accelerators that implement the same subroutine, HALO can assign a subroutine from an accelerator node one at a time, and round-robin across all the accelerator nodes until the request is complete. The compact strategy fills a provision request on the same accelerator node until the number of virtual slots on the accelerator node is filled.

4. Virtualization Agent Overview

FIG. 12 is a more detailed software stack diagram of the virtualization agent 24 in the HALO embodiment of FIG. 8 . The virtualization agent 24 and subcomponents therein can use hardware-specific logic to implement the virtualization and the hardware interface to the hardware runtime or drivers. The virtualization agent 24 includes a three-stage pipeline, which includes a HALO-to-accelerator interface. Stage 1 of the virtualization pipeline is a system services stage. This stage handles requests unrelated to the device (or cached device data), such as queueing, fusing, partitioning, merging, and caching subroutine meta-data, and subroutines availability.

In stage 2, device services are the main driving stage for device status updates and flow of control data into the device manager which is the main submodule for doing hardware-agnostic activities that include submitting routines to the hardware driver 44. Stage 2 mainly brings things out of the device memory and places them into the shared memory of the virtualization agent 24 which stays there until all the dependency (partitions) to the routine is complete then the control pipeline between the stages fuses and moves data back to the accelerator agent 36 and be expressed back to the entry point.

Stage 3 is the hardware-aware domain. In stage 3, the device manager 42 is isolated and needs to meet the stage 2 API control concept. In this stage, internals handle partitioning logic, device discovery subroutine interpretation, and hardware filtering.

5. HALO Interoperability

FIG. 13 is a diagram illustrating interoperability among a plurality of nodes implementing HALO. The processor domain entry point inherits application manager and request buffering. The processor domain is mainly to interconnect the processor nodes to accelerator nodes, and vice-versa. It also provides the internal synchronization primitives for sending and receiving data to and from the application. HALO mainly implements and extends functionality from the MPI standard to communicate collectively to other processor nodes for the purpose of synchronization and data movement and buffering across processor nodes.

IV. EVALUATION

A. Evaluation Setup

This section evaluates the performance of four different types of implementations for eight widely used HPC kernels based on three different types of computing devices (i.e., CPU, GPU, FPGA). The four implementation types are hardware-optimized baseline implementations using the best available vendor-suggested libraries or frameworks, hardware-specific and hardware-agnostic implementations using OpenCL, and hardware-agnostic implementations based on HALO.

It should be understood that this evaluation was made on a dual-agent embodiment on a single-node heterogeneous computing system, and the evaluation focuses on performance portability. Multi-agent embodiments of HALO enable other benefits for extreme-scale heterogeneous computing systems, such as performance scalability, productivity, and resilience, that are not reflected here.

The eight HPC kernels are matrix-matrix multiplication (MMM), element-wise matrix multiplication (EWMM), sparse matrix-matrix multiplication (SMMM), matrix-vector multiplication (MVM), element-wise matrix division (EWMD), vector dot-product (VDP), Jacobi solver (JS), and one-dimensional convolution (1DConv). These kernels represent a set of core computational workload prevalent in multiple HPC disciplines, such as computational fluid dynamics, computational material science, weather forecasting, and deep learning. Although these kernels are relatively simple, such complexity is sufficient to evaluate the orchestration performance of HALO. Although these kernels are relatively simple, such complexity is sufficient to evaluate the orchestration performance of HALO, as the mechanism of orchestration is independent of the kernel definition. The kernels are evaluated on a HALO test harness configured with 1 parent and claiming 1 child rank, associated with a single accelerator.

The performance metrics used for evaluation are HALO overhead (T1) between the runtime and virtualization agent (IPC with ZMQ), hardware overhead (T2), kernel execution time (T3), and the total runtime (T4=T1+T2+T3). The HALO overhead is the round-trip time for sending and receiving inputs and outputs to/from the virtualization agent, which is the overhead cost imposed by HALO. The HALO overhead depends on the total size of input and output arguments and is independent of kernel execution time. The hardware overhead (T2) refers to the data transfer time between the host and accelerator device memories.

The evaluations are performed on two server nodes. The GPU node runs on Ubuntu 18.04 and is equipped with 64 GB of DRAM and dual sockets of Intel Xeon E5-2620 v4 CPU, each hosting two NVIDIA GeForce RTX 2080 Ti GPUs. The FPGA node runs on Redhat 7 and is equipped with 32 GB of DRAM and an Intel Xeon E3-1275 v5 CPU hosting two BittWare 385A FPGA accelerators (Intel Arria 10 GX1150 FPGA). The CPUs used for evaluation are the ones on the GPU node. NVIDIA driver 440.100 and CUDA toolkit 10.2 are used for the GPUs. Intel FPGA SDK for OpenCL with Quartus Prime Pro 19.1 is used for the FPGAs. The SyCL DPC++ beta compiler from DPC++ daily 2020-07-23 is used.

The CPU-optimized baseline implementations leverage Intel Math Kernel Library (MKL) and handwritten C++. The GPU-optimized baseline implementations leverage a combination of Thrust, NVIDIA libraries, and handwritten C++ (including CUDA). The FPGA-optimized baseline implementations are based on OpenCL 1.1 with high-level synthesis. The host programs are compiled with GCC 7.5 for each platform, and HALO is compiled with GCC 10.1.

The hardware-specific OpenCL implementations leverage standard OpenCL with hardware-specific optimization, such as SIMD width, compiler flag optimization, memory coalescing, channels extension, and other hardware-specific attribute optimization. The hardware-agnostic OpenCL implementations remove such hardware-specific optimization in the host or device code. It should be noted that these hardware-optimized baselines represent the highest performing implementations. These baselines are used to demonstrate that HALO is able to maintain the highest performance achievable by hardware-optimized implementations while enabling a hardware-agnostic host program. As the existing portability frameworks or heterogeneous programming languages fail to provide truly hardware-agnostic programming interfaces nor host programs, they are not considered as counterparts for comparison.

FIG. 14 illustrates a template of a host application source code (e.g., an application data steering program) of a HALO implementation used for the evaluation. The hardware-agnostic HALO implementations leverage HALO configured with a runtime agent and five virtualization agents (e.g., using the embodiment of FIG. 5 ). The virtualization agents support the virtualization of Intel CPU-based libraries, Intel CPU OpenCL/SyCL, NVIDIA CUDA/Thrust, NVIDIA GPU OpenCL, and Intel FPGA OpenCL runtimes. Taking advantage of hardware agnosticism and transparent interoperability, HALO can always leverage hardware-optimized baseline implementations to accelerate the hardware-agnostic HALO implementations.

To reveal the performance degradation of transitioning from baseline to OpenCL implementations a metric of performance penalty (%) is defined as (T3_(OpenCL)−T3Baseline)/T3_(Baseline)×100. To compare the performance portability between the two hardware-agnostic solutions, a metric of performance portability score is defined as T3_(Baseline)/T3_(Hardware-agnostic). Performance portability score ranges from 0 to 1, which quantifies the ability of a hardware-agnostic implementation to maintain a high performance (low kernel execution time) relative to the hardware-optimized implementation across different computing devices. To reveal the impact of HALO overhead on the total runtime, a metric of HALO overhead ratio is defined as as T1_(HALO)/T4_(HALO).

The evaluations are conducted using working set sizes (WSS), ranging from 48 MB to 1 GB. It should be noted that T1, T2, and T3 all increase near-linearly as WSS scales up. As a result, it is observed from the evaluations that the performance penalty, performance portability score, and HALO overhead ratio are all invariant to WSS when WSS is sufficiently large. Therefore, the WSS range used is sufficient to project the evaluation results for larger WSS as well.

B. Evaluation Results

Table 2 shows the performance penalty of hardware-specific and hardware-agnostic OpenCL implementations. For the CPU and GPU, the hardware-specific OpenCL implementations suffer from a performance penalty of 0%-484% and 0%-2,430%, respectively, with the majority achieving <63% performance penalty over the hardware-optimized baselines.

TABLE 2 Performance Penalty (%) of hardware-specific (HS) and hardware-agnostic (HA) OpenCI implementations HS-OpenCL HA-OpenCL Kernel Name CPU GPU FPGA CPU GPU FPGA MMM 204%  47% 0% 1,892%  2,865%  246,479%  EWMM 54% 58% 0% 162% 131%  5.6% SMMM 484%   0% 0% 4,491%  897% 9,778% EWMD 6.2%  46% 0%  60% 129%  1.4% VDP  0% 3.9%  0% 349%  78% 1,157% JS 22% 2,430%   0% 215% 3,233%  1,440% MVM 63% 400%  0% 357% 376,300%    3,214% 1DConv 51% 14% 0% 182% 738,657%    58,182% 

The performance impact after removing hardware-specific optimization paints a grimmer picture. For the CPU, GPU, and FPGA, the hardware-agnostic OpenCL implementations suffer from a much bigger performance penalty of 60%-4,491% and 78%-738,657%, and 1.4%-246,479%, respectively. Therefore, such a large variance of performance penalty makes OpenCL hardly a practical solution to hardware-agnostic programming for the future of heterogeneous computing for HPC or other large-scale systems.

Although not represented, the best baseline performance for FPGAs should be from RTL-based implementations. Due to time constraints, the hardware-specific OpenCL implementations are also used to represent the FPGA-optimized baseline; hence no degradation is reported in the FPGA column. However, the most to suffer due to the transition from a baseline to a hardware-agnostic OpenCL implementation are FPGAs. The reason is that the implementations fail to explore spatial or temporal parallelism in FPGAs without hardware-specific optimization. Similarly, because all three accelerators evaluated do not have just-in-time (JIT) compilers, the use of runtime recompilation and use of the associated compiler flags for optimization purposes are restricted from the hardware-agnostic implementations causing further performance drops.

Table 3 shows the performance portability score comparison between the HALO and hardware-agnostic OpenCL implementations. The hardware-specific OpenCL implementations are not portable at all in this definition, thus have no performance portability score. Because HALO adds a firm level of abstraction and features transparent interoperability and extensibility, the virtualization agents can always leverage hardware-optimized baseline implementations to accelerate the hardware-agnostic application codes. With the abstraction layer provided by HALO, all hardware-specific implementation details that are critical for performance can be hidden from the hardware-agnostic applications while still assuring the same performance as the hardware-optimized baselines across the board. Therefore, the HALO implementations consistently achieve the maximum performance portability score of 1.0 across all the kernels and devices.

TABLE 3 Performance portability score comparison between HALO and hardware-agnostic OpenCL implementations HALO (HALO/HA-OpenCL) HA-OpenCL Kernel Name CPU GPU FPGA CPU GPU FPGA MMM 1.00 (20x) 1.00 (30x) 1.00 (2,466x) 0.05 0.03 4.06e−4 EWMM 1.00 (3x) 1.00 (2x) 1.00 (4x) 0.38 0.43 0.23 SMMM 1.00 (46x) 1.00 (10x) 1.00 (99x) 0.02 0.10 0.01 EWMD 1.00 (2x) 1.00 (2x) 1.00 (4x) 0.62 0.44 0.27 VDP 1.00 (4x) 1.00 (2x) 1.00 (13x) 0.22 0.54 0.08 JS 1.00 (3x) 1.00 (33x) 1.00 (15x) 0.32 0.03 0.06 MVM 1.00 (5x) 1.00 (94,100x) 1.00 (33x) 0.22 1.10e−5 0.03 1DConv 1.00 (3x) 1.00 (861,883x) 1.00 (581x) 0.36 1.20e−6 1.72e−3

Due to the large performance penalty, the hardware-agnostic OpenCL implementations suffer from a low performance portability score ranging from 1.2e-6 to 0.62, indicating unstable and poor performance portability. To justify a practical solution to hardware-agnostic programming for future heterogeneous computing systems, an average performance portability score of at least 0.95, indicating true performance portability, is needed.

Table 4 shows the software overhead ratio of HALO. It is shown that the overhead ranges from 17%-87%, 38%-56%, and 1.6%-60% for the CPU, GPU, and FPGA, respectively, with the smallest overhead belonging to the matrix-matrix multiplication kernel for all platforms. This is due to its high computational complexity of O(n3). In addition, the HALO overhead proportionately increases from 31 ms to 409 ms as the input payload increases from 48 MB to 1 GB. Note that HALO is far from optimal. The HALO overhead is directly attributed to the lack of a shared memory model across the boundary of different HALO system agents. Further embodiments mitigate the overhead by implementing an asynchronous partitioned global address space to store input and output payloads.

TABLE 4 Software overhead of HALO Time (ms) HALO Overhead Ratio T4 (T1/T4) Kernel Name T1 CPU GPU FPGA CPU GPU FPGA SMMM 31.00 74.00 81.00 1,874.00 42% 38% 1.6%  MMM 76.00 438.00 200.00 752.00 17% 38% 10% VDP 102.00 336.01 253.13 233.00 30% 40% 44% EWMM 306.00 501.63 559.61 1,034.57 61% 55% 30% EWMD 306.00 611.78 563.56 1,024.34 50% 54% 30% JS 402.00 1,354.48 722.88 4,514.00 30% 56% 8.9%  1DConv 402.00 2,056.00 849.00 892.00 20% 47% 45% MVM 409.00 472.00 727.00 683.00 87% 56% 60%

V. FLOW DIAGRAM

FIG. 15 is a flow diagram of a method for providing HALO in a heterogeneous computing system (e.g., a HPC, data center, or edge computing system). Dashed boxes represent optional steps. The process begins at operation 1500, with receiving a hardware-agnostic code comprising a request to execute a computational function. The process optionally continues at operation 1502, with selecting a hardware accelerator best suited for executing the computational function. The process continues at operation 1504, with determining that the hardware accelerator is available to execute the computational function. The process continues at operation 1506, with interfacing with the hardware accelerator to execute the computational function.

Although the operations of FIG. 15 are illustrated in a series, this is for illustrative purposes and the operations are not necessarily order dependent. Some operations may be performed in a different order than that presented. Further, processes within the scope of this disclosure may include fewer or more steps than those illustrated in FIG. 15 .

VI. COMPUTER SYSTEM

FIG. 16 is a block diagram of a computer system 1600 using HALO according to embodiments disclosed herein. The computer system 1600 can be implemented as a heterogeneous computing system. The computer system 1600 comprises any computing or electronic device capable of including firmware, hardware, and/or executing software instructions that could be used to perform any of the methods or functions described above, such as providing HALO. In this regard, the computer system 1600 may be a circuit or circuits included in an electronic board card, such as a printed circuit board (PCB), a server, a personal computer, a desktop computer, a laptop computer, an array of computers, a personal digital assistant (PDA), a computing pad, a mobile device, or any other device, and may represent, for example, a server or a user's computer.

The exemplary computer system 1600 in this embodiment includes a processing device 1602 or processor, a system memory 1604, and a system bus 1606. The system memory 1604 may include non-volatile memory 1608 and volatile memory 1610. The non-volatile memory 1608 may include read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and the like. The volatile memory 1610 generally includes random-access memory (RAM) (e.g., dynamic random-access memory (DRAM), such as synchronous DRAM (SDRAM)). A basic input/output system (BIOS) 1612 may be stored in the non-volatile memory 1608 and can include the basic routines that help to transfer information between elements within the computer system 1600.

The system bus 1606 provides an interface for system components including, but not limited to, the system memory 1604 and the processing device 1602. The system bus 1606 may be any of several types of bus structures that may further interconnect to a memory bus (with or without a memory controller), a peripheral bus, and/or a local bus using any of a variety of commercially available bus architectures.

The processing device 1602 represents one or more commercially available or proprietary general-purpose processing devices, such as a microprocessor, central processing unit (CPU), or the like. More particularly, the processing device 1602 may be a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a processor implementing other instruction sets, or other processors implementing a combination of instruction sets. Examples of the processing device 1602 may include a Host CPU node, a CPU cluster, an FPGA or FPGA cluster, GPU or GPU cluster, or a TPU or TPU cluster. The processing device 1602 may also be an application-specific integrated circuit (ASIC), for example. The processing device 1602 is configured to execute processing logic instructions for performing the operations and steps discussed herein.

In this regard, the various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with the processing device 1602, which may be a microprocessor, FPGA, a digital signal processor (DSP), an ASIC, or other programmable logic device, a discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. Furthermore, the processing device 1602 may be a microprocessor, or may be any conventional processor, controller, microcontroller, or state machine. The processing device 1602 may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).

The computer system 1600 may further include or be coupled to a non-transitory computer-readable storage medium, such as a storage device 1614, which may represent an internal or external hard disk drive (HDD), flash memory, or the like. The storage device 1614 and other drives associated with computer-readable media and computer-usable media may provide non-volatile storage of data, data structures, computer-executable instructions, and the like. Although the description of computer-readable media above refers to an HDD, it should be appreciated that other types of media that are readable by a computer, such as optical disks, magnetic cassettes, flash memory cards, cartridges, and the like, may also be used in the operating environment, and, further, that any such media may contain computer-executable instructions for performing novel methods of the disclosed embodiments.

An operating system 1616 and any number of program modules 1618 or other applications can be stored in the volatile memory 1610, wherein the program modules 1618 represent a wide array of computer-executable instructions corresponding to programs, applications, functions, and the like that may implement the functionality described herein in whole or in part, such as through instructions 1620 on the processing device 1602. The program modules 1618 may also reside on the storage mechanism provided by the storage device 1614. As such, all or a portion of the functionality described herein may be implemented as a computer program product stored on a transitory or non-transitory computer-usable or computer-readable storage medium, such as the storage device 1614, volatile memory 1608, non-volatile memory 1610, instructions 1620, and the like. The computer program product includes complex programming instructions, such as complex computer-readable program code, to cause the processing device 1602 to carry out the steps necessary to implement the functions described herein.

An operator, such as the user, may also be able to enter one or more configuration commands to the computer system 1600 through a keyboard, a pointing device such as a mouse, or a touch-sensitive surface, such as the display device, via an input device interface 1622 or remotely through a web interface, terminal program, or the like via a communication interface 1624. The communication interface 1624 may be wired or wireless and facilitate communications with any number of devices via a communications network in a direct or indirect fashion. An output device, such as a display device, can be coupled to the system bus 1606 and driven by a video port 1626. Additional inputs and outputs to the computer system 1600 may be provided through the system bus 1606 as appropriate to implement embodiments described herein.

The operational steps described in any of the exemplary embodiments herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary embodiments may be combined.

Those skilled in the art will recognize improvements and modifications to the preferred embodiments of the present disclosure. All such improvements and modifications are considered within the scope of the concepts disclosed herein and the claims that follow. 

What is claimed is:
 1. A hardware-agnostic accelerator orchestration (HALO) framework in a heterogeneous computing system, the HALO framework comprising: a runtime agent configured to implement an application interface for receiving a hardware-agnostic request to execute a computational function; and a virtualization agent configured to implement an accelerator interface for providing a hardware-specific instruction for a hardware accelerator to execute the computational function.
 2. The HALO framework of claim 1, further comprising a plurality of virtualization agents, each of the plurality of virtualization agents providing hardware-specific instructions for a different accelerator specification.
 3. The HALO framework of claim 1, wherein the runtime agent and the virtualization agent run asynchronously from each other.
 4. The HALO framework of claim 1, wherein the runtime agent comprises: a compute-centric message passing interface (C²MPI) application interface as the application interface; and a C²MPI accelerator interface configured to select a hardware accelerator specification suitable to execute the computational function.
 5. The HALO framework of claim 4, wherein the virtualization agent is further configured to monitor and manage available hardware accelerator resources.
 6. The HALO framework of claim 1, further comprising a bridge agent configured to manage data routing between the runtime agent and the virtualization agent.
 7. The HALO framework of claim 6, further comprising an accelerator agent configured to select a hardware accelerator specification suited to execute the computational function.
 8. The HALO framework of claim 7, wherein the bridge agent can be triggered by the runtime agent or the accelerator agent.
 9. The HALO framework of claim 1, wherein the virtualization agent comprises a hardware manager configured to manage access to a plurality of remote accelerators.
 10. The HALO framework of claim 1, wherein the HALO framework is implemented using a central processing unit (CPU) in the heterogeneous computing system.
 11. The HALO framework of claim 10, further comprising an inter-HALO application programming interface (API) configured to coordinate data movement with another HALO instance implemented using another CPUs in the heterogeneous computing system.
 12. A method for providing hardware-agnostic accelerator orchestration (HALO) in a heterogeneous computing system, the method comprising: receiving a hardware-agnostic code comprising a request to execute a computational function; determining that a hardware accelerator is available to execute the computational function; and interfacing with the hardware accelerator to execute the computational function.
 13. The method of claim 12, further comprising selecting the hardware accelerator best suited for executing the computational function.
 14. The method of claim 13, wherein selecting the hardware accelerator best suited for executing the computational function is based on a rank of the computational function.
 15. The method of claim 14, wherein the hardware-agnostic code defines a host process having a parent rank and the request to execute the computational function has a child rank.
 16. The method of claim 13, wherein selecting the hardware accelerator best suited for executing the computational function is based on performance of the hardware accelerator.
 17. The method of claim 12, further comprising selecting the hardware accelerator based on a received configuration.
 18. The method of claim 12, wherein determining that the hardware accelerator is available to execute the computational function comprises selecting a group of accelerators suitable to accelerating a plurality of computational functions in a host application.
 19. The method of claim 12, wherein the heterogeneous computing system is a high-performance computing (HPC) system.
 20. The method of claim 12, wherein the heterogeneous computing system is an edge computing platform.
 21. The method of claim 12, wherein the heterogeneous computing system is a data center.
 22. A non-transitory computer-readable medium having stored thereon software instructions that, when executed by a processor, cause the processor to: launch a hardware-agnostic accelerator orchestration (HALO) framework comprising: a runtime agent implementing an application interface to receive a hardware-agnostic request to execute a computational function; and a virtualization agent implementing an accelerator interface to provide a hardware-specific instruction for a hardware accelerator to execute the computational function. 